Clumppling: cluster matching and permutation program with integer linear programming

Abstract Motivation In the mixed-membership unsupervised clustering analyses commonly used in population genetics, multiple replicate data analyses can differ in their clustering solutions. Combinatorial algorithms assist in aligning clustering outputs from multiple replicates so that clustering solutions can be interpreted and combined across replicates. Although several algorithms have been introduced, challenges exist in achieving optimal alignments and performing alignments in reasonable computation time. Results We present Clumppling, a method for aligning replicate solutions in mixed-membership unsupervised clustering. The method uses integer linear programming for finding optimal alignments, embedding the cluster alignment problem in standard combinatorial optimization frameworks. In example analyses, we find that it achieves solutions with preferred values of a desired objective function relative to those achieved by Pong and that it proceeds with less computation time than Clumpak. It is also the first method to permit alignments across replicates with multiple arbitrary values of the number of clusters K. Availability and implementation Clumppling is available at https://github.com/PopGenClustering/Clumppling.


Introduction
Population-genetic mixed-membership unsupervised clustering methods, such as Structure (Pritchard et al. 2000) and Admixture (Alexander et al. 2009), are essential tools for understanding patterns of genetic variation in populations.These methods make use of individual genomes to infer population-genetic clusters; each individual is assigned a membership vector, in which each entry represents the inferred membership of the individual in a specific cluster.
Mixed-membership unsupervised clustering typically employs stochastic steps so that output memberships from independent runs of a clustering approach on the same set of individuals can differ.Membership differences across replicate analyses with the same settings can result from one of two sources.One source is label-switching, in which the clusters and the patterns of co-clustering of individuals across clusters are identical between runs, but the clusters are ordered differently across runs.The second source is genuine multimodality, in which the clustering algorithm yields multiple local optima that represent different patterns of coclustering of individuals.The clustering methods are often employed with different settings, most notably for the number of clusters, denoted K, so that membership estimates in different runs can differ for additional reasons.Because multiple potential sources contribute to clustering differences among replicates, in mixed-membership unsupervised clustering analysis, it is important to align the clusters across these replicates-to resolve label-switching, to assess genuine multimodality with fixed settings, and to examine clustering solutions with different values of K.
Three main methods exist for resolving the cluster alignment across runs of population-genetic unsupervised clustering: Clumpp (Jakobsson and Rosenberg 2007), Clumpak (Kopelman et al. 2015), and Pong (Behr et al. 2016).The methods differ slightly in the precise cluster alignment problems they solve, the algorithmic rationale for their alignment solutions, and their computational performance.Our goal is to introduce a new method that expands the set of scenarios in which cluster alignment can be performed, connects cluster alignment in population genetics to classic techniques of combinatorial optimization, and introduces computational improvements.The new method is Clumppling: CLUster Matching and Permutation Program with integer Linear programmING.
2 Review of existing approaches

Clumpp
Clumpp (Jakobsson and Rosenberg 2007) aligns replicates with equally many clusters.It either enumerates all possible permutations of clusters for all replicates and returns the one that maximizes an average pairwise similarity between all replicates, or adopts a greedy algorithm to sequentially align successive replicates.The greedy algorithm reduces computational time but does not necessarily find the optimal alignment.To further speed up the alignment process, Clumpp has a "LargeKGreedy" method to sequentially align a replicate column-wise rather than all at once.

Clumpak
Clumpak (Kopelman et al. 2015) extends beyond Clumpp in two ways: algorithmic mode detection and algorithmic alignment of replicates with different numbers of clusters.First, it uses Clumpp to align replicates with the same K.It then detects modes by constructing a similarity network from pairs of aligned replicates and using the Markov clustering algorithm to define nearly identical groups of replicates in the network: modes.A mode is represented by the mean memberships of replicates in the mode.
To assess replicates with different numbers of clusters, Clumpak examines the alignment of modes with K þ 1 clusters to those with K clusters.It aligns replicates with K þ 1 and K clusters by including an empty cluster and solving a one-to-one matching problem.

Pong
Pong (Behr et al. 2016) views the problem of aligning pairs of replicates with equally many clusters as an assignment problem (Burkard et al. 2009).It uses the polynomial-time Hungarian algorithm for this optimization.
For mode detection among replicates with a fixed number of clusters, Pong follows a simple approach.It constructs a similarity network of aligned replicates and uses a userspecified threshold to remove edges, taking the disjoint cliques in the network as modes.A representative replicate is then randomly chosen from each clique to represent the mode.
To align a pair of representative replicates of the major modes with numbers of clusters K and K þ 1, Pong considers mergings of every possible pair of clusters from the replicate with K þ 1 clusters and finds the optimal alignment from all one-to-one alignments.It proceeds through consecutive K values to align major modes across different K values.

Improvements provided by Clumppling
The existing methods have been used in thousands of studies.Nevertheless, opportunities exist for more fully integrating cluster alignment methods with frameworks of combinatorial optimization and network theory, for addressing complex alignment scenarios, and for improving computation time.We highlight several desirable features of Clumppling (Table 1

Overview
In the application of mixed-membership unsupervised clustering, the first step is to obtain multiple clustering replicates at each of multiple values of the number of clusters K. Beginning from these replicates, the procedure of Clumppling to align replicates and to extract modes is as follows: 1) Group replicates according to the number of clusters K.
2) For each group of replicates with a shared K: a) For each pair of replicates in the group, obtain an optimal alignment with minimal pairwise cost.b) Detect subgroups of replicates belonging to shared modes.c) Obtain the consensus membership of each mode.
3) Align pairs of modes across different values of K using their consensus memberships.4) Visualize the aligned modes.This pipeline follows that of Clumpak and Pong (Clumpp does not perform steps 2b, 2c, or 3, and its step analogous to 2a does not involve finding all pairwise alignments).However, Clumpak and Pong address only a special case of step 3 in which pairs of modes to be aligned across K values are the major modes of replicates with K and K þ 1 clusters.Clumppling provides alignments between each pair of modes, major and minor, considering replicates that can have numbers of clusters that differ by more than one.This new step is informative on how major and minor modes relate across K values and can also assist in aggregating clustering results with large K in which nonconsecutive K values may be of interest.
For steps 2a and 2b, the central steps of the alignment procedure, we seek to improve performance and run time over previous methods.First, for pairwise alignment of replicates in step 2a, we use integer linear programming (ILP) from optimization theory.Second, for community detection in step 2b, we use the Louvain algorithm from network theory.

Initial setup: dissimilarity between replicates
Consider two replicates from a clustering algorithm on N individuals.Replicate 1, with K 1 clusters, can be represented as a matrix Q of size N Â K 1 .Entry q ik is the inferred membership coefficient of individual i in cluster k.Replicate 2, with K 2 clusters, is a matrix P of size N Â K 2 .Without loss of generality, suppose K 1 !K 2 .
To align two replicates, we need a measure that quantifies the similarity or dissimilarity of matrices Q and P. The problem of aligning the replicates can then be formulated as a problem of maximizing the similarity, or minimizing the dissimilarity, between membership matrices, one of whose columns is rearranged according to various proposed alignments.
For K 1 ¼ K 2 ¼ K, Clumpp uses a pairwise similarity between two membership matrices, defined with the Frobenius matrix norm jj Á jj F as Clumpp seeks to find the optimal alignment of R replicates by maximizing a measure of mean pairwise similarity of the R replicates, termed H 0 .Clumpak uses this same method for the case of Equation (1) applies only to membership matrices of the same size.For K 1 > K 2 , we can consider measures that decompose the calculation into two levels: similarity or dissimilarity first between clusters, one from one replicate and one from the other, and second, between replicates.
Pong uses this two-level idea to define the similarity between replicates.Let q Ái denote the membership coefficients in cluster i of replicate 1, with entries q 'i for ' ¼ 1; 2; . . .; N; q Ái is the ith column of matrix Q.Similarly, let p Áj denote the membership coefficients in cluster j of replicate 2. A cluster similarity is derived from the Jaccard index on the overlap in membership coefficients between clusters q Ái and p Áj as where N Ã ¼ f' 2 f1; 2; . . .; Ng : q 'i þ p 'j > 0g.N Ã is the set of rows with nonzero membership in cluster i of Q, cluster j of P, or both.The similarity between replicates is then defined as the mean cluster similarity across all clusters for a pair of replicates: where j 0 is the cluster in P to which cluster i in Q aligns.
If K 1 ¼ K 2 ¼ K and all entries of the membership matrices are nonzero, i.e.N Ã ¼ N, then J has a form close to G 0 , though not quite equal to it.Equations ( 1) and (3) can be rewritten as follows: G 0 ranges from 0 to 1. G 0 ¼ 1 trivially if P ¼ Q. G 0 ¼ 0 if for all ' ¼ 1; 2; . . .; N, p 'k ¼ 1 and q 'k 0 ¼ 1 for some k 0 6 ¼ k.In this case, ðp 'k À q 'k Þ 2 ¼ 1 for 2 N of the NK pairs ð'; kÞ.However, the similarity measure J used by Pong does not reach 0. Because For Clumppling, we seek a dissimilarity measure for membership matrices that (1) permits K 1 > K 2 , and (2) spans the full unit interval [0, 1], with a value of 0 for matrices with no overlap and a value of 1 for identical memberships.We use a measure with the two-level composition of Pong but with a form more similar to that of Clumpp and Clumpak.
For the dissimilarity between cluster i of replicate 1 and cluster j of replicate 2, Clumppling uses For the dissimilarity between two replicates with Continuing with K 1 ¼ K 2 , dissimilarity DðQ; PÞ is related to similarity G 0 [Equations ( 1) and (4)]: We will see shortly how to proceed if K 1 > K 2 .

Step 2a: pairwise alignment
Given a dissimilarity measure, the alignment of two replicates involves permuting the clusters of one replicate-the columns of its associated matrix-to minimize the dissimilarity with the other replicate.
If K 1 ¼ K 2 ¼ K, then aligning two replicates is the problem of finding the optimal one-to-one permutation that minimizes DðQ; aðPÞÞ, where aðPÞ is the matrix P with columns permuted under a permutation a of ½K ¼ f1; 2; . . .; Kg.Minimizing the dissimilarity is equivalent to maximizing the similarity, the problem considered by Clumpp.
Linear programming (LP) concerns the problem of maximizing or minimizing a linear objective function subject to linear equality and inequality constraints (Schrijver 1998).These constraints form a feasible region of a convex polyhedron for variables that are optimized.LP problems are represented in canonical form by min x fc T xjAx b; x !0g, where x ¼ ðx 1 ; x 2 ; . . .; x n Þ T records the n variables to be optimized, c T x is the objective function (cost function) to be minimized, and Ax b and x !0 summarize the linear constraints.ILP problems have additional constraints that some variables are integers; in those dimensions, the feasible set for the variables is restricted to lattice points in the polyhedron.An ILP problem in which all variables must be integers can be represented in canonical form min x fc T xjAx b; x !0; x 2 Z n g.Although ILP problems are NP-complete (Schrijver 1998), we can capitalize on extensive effort devoted to solving them as standard problems in optimization theory.
To formulate the pairwise alignment problem with ILP, we place dissimilarities Cðq Ái ; p Áj Þ between pairs of clusters in two replicates in a K 1 Â K 2 matrix C. Denote the alignment a as a K 1 Â K 2 indicator matrix W, where Because a is a many-to-one mapping, each row of W in constrained to sum to exactly 1, and each column of W has sum at least 1.
The dissimilarity between two replicates in Equation ( 9) can be written as follows: The alignment problem can now be formulated with ILP: In fact, this minimization is an instance of binary linear programming (Wolsey 2020), in which variables are restricted to zeros and ones.The canonical form of this ILP problem appears in Supplementary Methods.A pairwise alignment problem framed in this manner can then be solved by standard ILP methods.Clumppling uses the branch-and-cut algorithm (Mitchell 2002).The optimal solution w Ã that minimizes the objective function corresponds to the optimal alignment under the chosen dissimilarity measure [Equation ( 12), as reformulated in Supplementary Equation ( 1)].

Step 2b: mode detection via community detection
For all replicates with equally many clusters, an optimal alignment is obtained for each pair using ILP as described in the previous section.Suppose there are R K replicates with K clusters, and their membership matrices are Q Between a pair of replicates i and j, the optimal alignment of To align all replicates simultaneously, Clumppling constructs an undirected graph G K ¼ ðV; EÞ using replicates as nodes V ¼ f1; 2; . . .; R K g.Edge set E ¼ fði; jÞji ¼ 1; 2; . . .; R K ; j ¼ 1; 2; . . .; R K g has weights u ij negatively weighted by the normalized dissimilarity of optimal alignments: u ij ¼ 1 for i ¼ j, and for i 6 ¼ j, where A higher weight indicates greater similarity of two replicates under their optimal alignment.The R K Â R K matrix of edge weights is symmetric.
We seek to find sets of replicates that are collectively similar to one another when optimally aligned, or modes.For this task, we rely on community detection algorithms (Fortunato 2010, Javed et al. 2018).In a graph, communities are groups of nodes that are more densely connected within the group than outside the group.A community in the graph G K corresponds to a mode in the unsupervised clustering analysis.In terms of the associated edge-weight matrix, a matrix with nontrivial communities has a block-diagonal structure, with two or more blocks corresponding to communities.Entries within a block tend to exceed entries outside it for the associated rows and columns-indicating greater edge weights for node pairs within the same community than for pairs not in the same community.
Clumppling first tests a null hypothesis of no community structure.We use a test of Tokuda (2018), based on differences between (i) random symmetric matrices with a blockdiagonal community structure in which within-block off-diagonal entries are distributed differently from outside-block off-diagonal entries, and (ii) random symmetric matrices in which all off-diagonal entries are independently and identically distributed.Considering the largest and smallest eigenvalues of two transformed versions of the symmetric edge-weight matrix of the graph, the null hypothesis is rejected if one or both eigenvalues (of either matrix) lies outside specified intervals.In our application, if the null hypothesis of no community structure is rejected at p ¼ 0:01, then Clumppling proceeds to identify community structure in the undirected weighted graph G K .
For community detection, Clumpak uses Markov clustering (MCL) (Van Dongen 2000), employing a threshold to remove some edges with lower weights from the network-i.e. to replace smaller edge weights with values of 0-thereby reducing the density of edges.Clumppling instead uses the Louvain method (Blondel et al. 2008), which does not require premodification of the network.This algorithm has a "resolution" parameter that affects the size of the detected communities; larger values typically find smaller communities and more of them.Clumppling allows the user to specify its value, with a default of 1.
The outcome of mode detection via community detection algorithms is a set of communities of nodes, each of which corresponds to a subset of replicates that belong to the same mode.Modes are disjoint so that replicate each belongs to exactly one mode.Suppose m K communities are detected, which we denote It is possible for a mode to possess only one replicate, jM ' K j ¼ 1, although this "singleton" situation is somewhat unusual.

Step 2c: consensus memberships for modes
For each mode ', we obtain a consensus membership matrix in one of two ways.First, we obtain a mean membership matrix of its replicates: where, for simplicity, the Q ðiÞ K are treated as having already been aligned.
We also obtain a representative membership matrix, the matrix of the replicate that has the largest sum of edge weights within the community: K can be used as the consensus membership of replicates in the mode.Clumppling uses Q ' K as its default.Suppose the set of the distinct numbers of clusters is K.After building networks of replicates for each K, we obtain a set of modes together with consensus memberships:

3.6
Step 3: alignment of modes across K Finally, with modes defined, we align modes across values of K.In particular, we order the values of K 2 K in decreasing order: We then obtain a multipartite graph of the pairwise alignments between modes across different values of K.
For adjacent K values in K, K i and K j , where j ¼ i þ 1, there exist m Ki Â m Kj pairs of modes.For each such pair, we use ILP [Equation ( 12)] to align consensus memberships from Equation ( 16).The optimal dissimilarity is DðQ 'i Ki ; a Ã ðQ 'j Kj ÞÞ, where ' i 2 ½m Ki , ' j 2 ½m Kj , and a Ã is the mapping of clusters of Q Ki that produces the optimum.We then use these dissimilarities between modes as weights to create a bipartite graph between modes of K i and modes of K j .For instance, the weight between mode ' i of K i and mode ' j of K j can be set to where larger weights indicate closer alignments.Note that theoretically, the dissimilarity D can exceed 1 for a pair of modes with different numbers of clusters.However, such a situation requires pairs of clusters to be matched extremely poorly between different values of K, an unlikely scenario after optimal alignment.Hence, negative weights in Equation ( 17) are unlikely.
Combining the bipartite graphs between pairs of adjacent values of K, we obtain a jKj-partite graph representing alignments across modes with different numbers of clusters.We call our approach to the alignment of modes across K the "direct" approach.
For consecutive values of K, we also consider a second approach that modifies the method of Pong; we term this second approach the "merge" approach.This approach enumerates all possible ways to merge a pair of clusters from the mode with K þ 1 clusters to produce K clusters.It then uses our ILP method to align two matrices of the same size.A total of K þ 1 2 alignments are performed, and the one that achieves the smallest dissimilarity is chosen to be the optimal alignment between the two modes.Note that unlike the "direct" approach, which permits nonconsecutive K values, the "merge" approach requires the values of K to be consecutive.

Visualization
To display all modes at all numbers of clusters, we proceed sequentially in the order ðK 1 ; K 2 ; . . .
if their numbers of clusters are consecutive), we choose the most closely aligned pair of modes between them as "anchors."We then perform the following three steps: 1) All other modes with K i clusters are aligned to the K i anchor.2) All other modes with K iþ1 clusters are aligned to the K iþ1 anchor.
3) The modes of these two different K values are then aligned according to the alignment of the ðK i ; K iþ1 Þ anchor pair.
For example, consider three numbers of clusters ðK 1 ; K 2 ; K 3 Þ, each with two modes.Suppose the most closely aligned mode pair for and that for . First, we align modes of ðK 1 ; K 2 Þ: (1) K 1 -Mode 2 is aligned to the anchor K 1 -Mode 1 .
(3) Clusters in modes of these two K values are rearranged according to the alignment between K 1 -Mode 1 and K 2 -Mode 2 .Next, we align modes of ðK 2 ; K 3 Þ in the same way: (1) K 2 -Mode 2 is aligned to the anchor K 2 -Mode 1 -which was already accomplished in the previous step (1), as alignments between modes with the same K apply symmetrically.(2) K 3 -Mode 1 is aligned to the anchor K 3 -Mode 2 .
(3) Across modes of K 2 and K 3 , the alignment follows that between K 2 -Mode 1 and K 3 -Mode 2 .In this way, all modes across ðK 1 ; K 2 ; K 3 Þ are aligned.
For visualization of aligned modes, Clumppling plots each mode as a "classic" structure plot-a stacked bar chart of equal height-with clusters represented by different colors (e.g.Rosenberg 2004).The number of replicates in a mode is marked above the associated plot.To visualize relationships between modes with different numbers of clusters, modes appear in a multipartite graph according to the across-K alignment.Modes with the same K appear in the same "layer," where the number of layers is K, the number of distinct K values considered.For each K, modes are ordered in decreasing size (i.e. the number of replicates in the mode) and decreasing within-mode similarity [described in Equation ( 18)].Modes with adjacent K values are joined by edges colored based on the edge weight [Equation ( 17)], with darker colors indicating larger weights and closer alignments.Minimal dissimilarities between pairs of modes under their optimal alignment-i.e. the optimal objective function values from Equation ( 12)-appear as labels on the edges, with smaller values indicating closer alignments.To visualize the variability within a mode, in addition to the structure plot, Clumppling provides a histogram of pairwise dissimilarities under optimal alignment of replicates within modes (Supplementary Fig. S1).

Evaluation of performance
The Clumppling implementation is described in the Supplementary Methods.We compare the alignment performance of Clumppling to the two existing methods, Clumpak and Pong, that align replicates across K values.Because they only align equal or consecutive K values, the evaluation does so as well-although Clumppling accommodates replicates with numbers of clusters differing by more than 1.

Performance measure
For replicates with a shared K, we use a performance measure based on the similarity score H 0 of Clumpp and Clumpak.For a mode M ' K , let H 0 ' denote the mean similarity score for all pairs of replicates in the mode after its replicates Q 1 ; Q 2 ; . . .Q jM ' K j are all aligned pairwise: where a Ã is interpreted as the mapping of clusters of Q j to clusters of Q i for the optimal alignment of Q i and Q j .
Next, to obtain a mean similarity score involving all replicates and modes, we calculate a weighted similarity score H that assigns each replicate the mean similarity score of its associated mode: We modify Equation ( 19) to exclude singleton modes ' with jM ' K j ¼ 1.Because singletons always have similarity score 1 in Equation ( 18), their existence can upwardly bias the weighted similarity score.With s K singleton modes, the singleton-excluded weighted similarity score becomes Note that removal of the singletons reduces the number of replicates that need to be aligned; it is useful to track the value s K along with HK .
Because we compare the performance of Clumppling with methods that only support the alignment of replicates with consecutive values of K, we use a performance measure suited to consecutive K values.To evaluate across-K alignments, we measure the G 0 similarity [Equation ( 1)] between the most closely aligned pair of modes aligned across each ðK; K þ 1Þ.For this computation, an optimal alignment a Ã has first already been identified for this pair of modes in Section 3.6.Two of the clusters in the mode with K þ 1 clusters are necessarily matched to the same cluster in the mode with K clusters, or a Ã ðiÞ ¼ a Ã ðjÞ for exactly one unordered pair of i; j 2 ½K þ 1, i 6 ¼ j.We apply the alignment a Ã : we merge these two clusters, adding their columns in the membership matrix together to produce a single column.Mathematically, suppose the original N Â ðK þ 1Þ membership matrix for the mode with K þ 1 clusters is P, with columns fp k g k2½Kþ1 .It now becomes a new N Â K membership matrix P 0 with columns p i ; for k ¼ 1; 2; . . .; K: Now the two membership matrices, Q and P 0 , both have size N Â K. G 0 ðQ; P 0 Þ is computed for these two matrices.Note that because this similarity calculation merges clusters, it measures the similarity of a mode with K clusters and a mode with K þ 1 clusters according to its value for a quantity optimized by the "merge" and not the "direct" approach.Hence, it is expected to have higher values when the "merge" rather than the "direct" approach is used to produce the optimal alignment.Because Pong uses the "merge" approach to align modes with consecutive numbers of clusters, the calculation evaluates Clumppling against Pong by a measure that Pong seeks to optimize.
We evaluate the performances of all four possible combinations of the mode consensus approach and the approach for performing alignments.That is, we use either "representative" [Equation ( 14)] or "average" [Equation ( 15)] as the consensus membership of a mode.For alignments of modes with K and K þ 1 clusters, we use either the "direct" approach or the "merge" approach in identifying the alignment with the minimal dissimilarity [Equation ( 9)].

Datasets
We demonstrate the use of Clumppling and compare the alignment performance of the methods with two datasets.The first contains an unsupervised Admixture analysis of 399 individuals focused on the human population of Cape Verde.This dataset has replicates with small values of K.It provides an example in which many individuals, including 44 individuals in the admixed population of Cape Verde (Verdu et al. 2017), possess nontrivial memberships in multiple clusters; the original analysis of Verdu et al. (2017) considered 50 replicates each at K ¼ 2; 3; 4; 5; we reanalyzed those replicates.For this dataset, we ran Clumppling using the default resolution parameter of 1 for the Louvain mode detection.
The second dataset provides an example of alignment with relatively large values of K: a study of 600 chickens from 20 populations (Rosenberg et al. 2001) focused on values of K ¼ 17; 18; 19, and we add K ¼ 20; 21 here ("chicken dataset").We begin from the original data, 27 genotyped loci in each of the 600 individuals, running Structure (Pritchard et al. 2000) for 20 replicates for each K from 17 to 21.We ran Structure with a burn-in period of length 5000 in the "Admixture" model followed by 50000 MCMC repetitions, as in Rosenberg et al. (2001).For mode detection in Clumppling, we used 1.05 for the resolution parameter; for this more challenging dataset, increasing the resolution above the default of 1 leads to a larger number of modes but with greater within-mode similarity.

Analysis of clustering replicates
For each value of K, we used Clumppling to align the 50 Admixture replicates for the Cape Verde dataset.Alignments based on mean memberships for the mode consensus and the "direct" approach for alignment across K values appear in Fig. 1.
For the chicken dataset, for each K, alignments based on mean memberships for the mode consensus and the "direct" approach for alignment across K values appear in Fig. 2.An additional analysis of alignments across non-consecutive K values appears in Supplementary Fig. S2.
For comparison, we ran Clumpak and Pong.Clumpak uses the LargeKGreedy algorithm of Clumpp to align replicates for fixed K values.In the MCL algorithm, it uses a threshold automatically generated from the graph properties to control inclusion of edges of the graph.It uses the Distruct for many K's feature to align single results-major modes-for different K values.Clumpak alignment results appear in Supplementary Fig. S3.
We ran Pong (Behr et al. 2016) using its sum-squared distance metric for calculating the similarity between clusters; we chose the sum-squared distance rather than the Jaccard similarity in Pong, as it is closer to the objective used by Clumppling.A fixed threshold of 0.9 is chosen for the Cape Verde data and 0.955 for the chicken data to exclude edges with weights below the threshold from the pairwise similarity graph of replicates for mode detection.We chose these values to be lower than the Pong default of 0.97 in order to avoid producing large numbers of singleton modes.When identifying disjoint cliques in the graph of pairwise similarities for mode detection, we used the Pong default greedy approach to iteratively remove the maximal clique from the graph if no disjoint cliques are found.Alignment results for Pong appear in Supplementary Figs S4 and S5.
For within-K alignments, the performance of Clumppling, Clumpak, and Pong appears for Cape Verde in Supplementary Table S1; Supplementary Table S2 shows the numbers of replicates detected within modes by the three methods for the two datasets.Supplementary Tables S3 and  S4 show corresponding results for the chicken dataset.For between-K alignments, Clumppling is evaluated in each of four  12).Each structure plot displays a mode, where the modes for the same K appear in decreasing order by their size-marked in parentheses above the top right corner of each plot-and then their within-mode similarity (if there is a tie in size).
combinations: using representative or average memberships, and using the "merge" and "direct" approaches to alignment.The performance of Clumppling, Clumpak, and Pong in across-K alignments for the Cape Verde dataset appears in Supplementary Table S5 and for the chicken dataset in Supplementary Table S6.

Performance for within-K alignment
In the Cape Verde dataset, for K ¼ 2 and K ¼ 3, all three methods find a single mode (Supplementary Table S2).No singletons are observed, and the singleton-excluded weighted similarity score H is 1 for all three methods (Supplementary Table S1).For K ¼ 4, although Clumpak gives the largest score, this score discards 11 singletons.For K ¼ 5, Clumppling has the fewest modes and the largest withinmode similarity between replicates.
In the chicken dataset, at different values of K, Clumppling achieves consistently greater values of the singleton-excluded similarity score (Supplementary Table S3).It does so while finding no singleton modes; the other two methods both identify singletons at most values of K (Supplementary Table S4).

Performance for across-K alignment
In the Cape Verde dataset, across-K alignments have comparable performance for the various methods, with high similarity scores (Supplementary Table S5).In the more challenging chicken dataset, with larger values of K, Clumppling produces the largest similarity scores between the most closely aligned pair of modes at consecutive values of K (Supplementary Table S6).The choice of the "representative" approach for mode consensus and the "merge" approach for cluster alignment gives the highest scores, but the three other choices all produce large values for the score as well.Clumpak and Pong achieve comparably high scores only in one of four transitions between K values (20-21 for Clumpak, 17-18 for Pong).

Run time
Run time with Clumppling (using the "direct" approach) is comparable to Pong; both are faster than Clumpak.In Clumppling, the "direct" approach is faster than the "merge" approach.A detailed comparison of the run time for the three methods appears in Supplementary Results and Supplementary Table S7.

Discussion
Clumppling is a new method for aligning replicate mixedmembership unsupervised clustering analyses.Building upon Clumpp (Jakobsson and Rosenberg 2007), Clumpak (Kopelman et al. 2015), and Pong (Behr et al. 2016), it performs alignment tasks that have not been addressed by earlier methods-alignment of all modes for one value of K with all modes of another value of K, and alignment of modes across nonconsecutive K values.Clumppling applies algorithms from combinatorial optimization and network theory in producing similar alignments to those obtained by the other methods (Figs 1 and 2 and Supplementary Figs S3-S5), often with higher values of a similarity score for replicates within modes at fixed K (Supplementary Tables S1 and S3) or modes at consecutive K values (Supplementary Tables S5 and S6), and in comparable or reduced computation time (Supplementary Table S7).Thus, it compares favorably with other methods in terms of its novel features, algorithmic justification, performance measures, and run time.

Algorithmic innovations
Clumppling introduces methodological advances for cluster alignment.Though Pong previously described the alignment problem as a standard combinatorial optimization problem, the Clumppling ILP formulation of the alignment of two replicates allows it to capitalize on efficiencies of ILP solvers.Hence, Clumppling is comparable in speed to Pong.
The ILP formulation also enables Clumppling to address new scenarios not covered by Pong.In particular, the many-to-one matching that it permits for clustering replicates with different values of the number of clusters makes it possible for Clumppling to perform alignments across values of K, including across nonconsecutive values.Such alignments can be useful, e.g. in analyses for which K extends over a large range (Funk et al. 2020); for large datasets in which computation is slow, it may be sensible to perform exploratory clustering with select values of K-say, every fifth value-to then summarize with Clumppling, and only then, if necessary, to consider consecutive K in a meaningful range.
Finally, combining the advance from Pong in formulating cluster alignment in terms of a classic setting in optimization with the advance from Clumpak of using community detection algorithms, Clumppling is able to perform a more comprehensive analysis of all observed modes.In particular, Clumppling aligns all modes across K values, rather than aligning only single modes at each K value, as in Clumpak and Pong.This alignment is informative to clarify clustering patterns in scenarios in which multimodality arises as K is increased but a major mode reappears as K is increased still further (e.g.Fig. 7 of Wang et al. 2007).
The formalization of the cluster alignment problem here establishes a framework for further enhancement.By clarifying the components of the problem-pairwise alignment of replicates at fixed K (step 2a), mode detection at fixed K (step 2b), defining the consensus of modes (step 2c), and aligning modes across K (step 3)-each component can be separately investigated.Optimization methods other than the ILP branch-and-cut algorithm and network-based clustering methods other than the Louvain algorithm can be further tested for improvements in their associated steps.

Empirical performance
The performance differences in our empirical examples are relatively small.In the Cape Verde example, alignments were clear across the methods, all of which performed comparably (Supplementary Tables S1, S2, and S5).In our more difficult chicken example, Clumppling produced the highest value for the mean similarity of replicates within modes (Supplementary Table S3), identified the fewest singletons (Supplementary Table S4), and produced the highest similarity scores between modes with consecutive values of K (Supplementary Table S6).In both cases, the modes themselves are similar across methods (Fig. 1 and Supplementary Figs S3A, S4A, and S5A for Cape Verde, and Fig. 2 and Supplementary Figs S3B, S4B, and S5B for chickens).
Together, Clumpp, Clumpak, and Pong have been widely used, all performing well in typical empirical settings.When clustering algorithms uncover clear structure-e.g. with replicable co-clustering of some individuals in one cluster and other individuals in another-the proper alignment is often clear, and it is likely to be found by all methods.In a modeling study describing alignment cost under a Dirichlet model, we have found that a correct permutation often has cost far below that of the other permutations (Liu et al. 2022).
Clumppling can be added to the list of methods that can be used to find this permutation.

Methodological choices and extensions
In developing Clumppling, we have made decisions about a number of methodological trade-offs.For the pairwise alignment step, Pong previously used the polynomial-time Hungarian algorithm for alignments at a fixed value of K.In Clumppling, we have chosen to use ILP, which is not polynomial-time and can be slower than the approach of Pong, though still faster than Clumpak (Supplementary Table S7).However, ILP offers the ability to facilitate alignments across both consecutive and nonconsecutive values of K; Pong accommodates only consecutive values.
For the alignment cost function, our framework allows any dissimilarity measure between replicates as the objective for the ILP problem-provided that it is a linear combination of pairwise between-cluster dissimilarities.Our specific quadratic function of entries in two membership matrices is grounded in the analysis of Liu et al. (2023).In particular, if two replicates have the same number of clusters, then the dissimilarity that Clumppling seeks to minimize via ILP is exactly (a constant multiple of) the alignment cost from Liu et al. (2023).
Choices in community detection affect the granularity of the modes obtained.In one extreme, each replicate is its own mode; in the other, replicates with truly distinct co-clustering patterns are grouped in the same mode.Clumppling follows Clumpak in using an adaptable parameter for tuning this granularity.Nevertheless, it is possible that replicates visually distinguishable as belonging to distinct modes might be grouped.Such patterns can sometimes be diagnosed by the appearance of membership vectors that, within individual replicates, are near a permutation of ð1; 0; 0; . . .; 0Þ, but that are not near a simplex vertex in the mean of replicates within the mode.In applications, users can increment the "resolution" parameter, e.g. by 0.05 or 0.01, choosing larger values to increase granularity and smaller values to decrease it.

Conclusions
With its formulation of the cluster alignment problem in defined steps, combination of pairwise alignment ideas based on Pong and community detection based on Clumpak, and addition of new features for mode alignment across values of K, Clumppling can assist in the many cluster alignments that take place in population-genetic data analysis.Notably, however, mixed-membership clustering, also sometimes known as soft or fuzzy clustering, has broad applications beyond population genetics, including elsewhere in bioinformatics, as well as in image analysis, marketing, and text mining (De Oliveira andPedrycz 2007, Airoldi et al. 2014).Problems of comparing and visualizing multiple clustering results in a general context have perhaps been of greater interest for hard clustering (Meil a 2007, Zhou et al. 2009, L'Yi et al. 2015), in which memberships of individuals are assigned to single clusters.For similar problems of soft clustering, the methods from population genetics-Clumppling and its predecessors-are available.Clumppling can be applied to any membership-based clustering algorithms applied multiple times on the same set of entities, and it potentially has broad applications in diverse uses of mixed-membership cluster analysis.
then the maximal value of DðQ; aðPÞÞ can exceed 1.Nevertheless, for a specific pair ðQ; PÞ, a smaller value of DðQ; aðPÞÞ always indicates a closer alignment.

Figure 1 .
Figure1.Clumppling-aligned modes for the Cape Verde dataset (K from 2 to 5), using the mean memberships as mode consensus and the "direct" approach to alignment across K values.The multipartite graph shows the alignment across different K. Edges are colored by the edge weight [Equation (17)]; darker color indicates a larger weight and thus better alignment.The numbers on the edges are the optimal solutions for pairwise alignments, representing minimum values in Equation (12).Each structure plot displays a mode, where the modes for the same K appear in decreasing order by their size-marked in parentheses above the top right corner of each plot-and then their within-mode similarity (if there is a tie in size).

Figure 2 .
Figure 2. Clumppling-aligned modes for the chicken dataset (K from 17 to 21), using the mean memberships as mode consensus and the "direct" approach to alignment across K values.The figure design follows Figure 1.
It performs alignments in the settings considered by Clumpp, Clumpak, and Pong and also expands to new settings.
ments between all modes at successive values of K, not only the major modes.4)Unlike the other methods, Clumppling performs alignments between replicates that differ in number of clusters by more than one.Clumppling combines benefits of Clumpak and Pong in relying on ideas used for optimization and alignment in other fields.